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Abstract. RNA-Seq is rapidly becoming the standard technology for transcriptome 
analysis. Fundamental to many of the applications of RNA-Seq is the quantification 
problem, which is the accurate measurement of relative transcript abundances from 
the sequenced reads. We focus on this problem, and review many recently published 
models that are used to estimate the relative abundances. In addition to describing 
the models and the different approaches to inference, we also explain how methods 
are related to each other. A key result is that we show how inference with many of 
the models results in identical estimates of relative abundances, even though model 
formulations can be very different. In fact, we are able to show how a single general 
model captures many of the elements of previously published methods. We also review 
the applications of RNA-Seq models to differential analysis, and explain why accurate 
relative transcript abundance estimates are crucial for downstream analyses. 



1. Introduction 

The direct sequencing of transcripts, known as RNA-Seq |1T] , has turned out to have 
many apphcations beyond those of expression arrays. These include genome annotation 
[9], comprehensive identification of fusions in cancer [55], discovery of novel isoforms of 
genes [57J, and even genome sequence assembly |40j. Moreover, RNA-Seq resolution is 
at the level of individual isoforms of genes [61] and can be used to probe single cells [58] . 
The wide variety of applications of RNA-Seq [HZI require the solution of many bioinfor- 
matics problems drawing on methods from mathematics, computer science and statistics 
[IS]. The primary challenges are read mapping [BIl], transcriptome assembly quan- 
tification of relative transcript abundances from mapped fragment counts (the topic of 
this review) and identification of statistically significant changes in relative transcript 
abundances when comparing different experiments [H]. At this point, it is therefore 
difficult, if not impossible, to survey the entire scope of work that constitutes RNA-Seq 
analysis in a single review. However among the many challenges there is one that is of 
singular importance: the accurate quantification of relative transcript abundances. The 
hope is that RNA-Seq will be more accurate than previous technologies, such as microar- 
rays or even qRT-PCR, for inferring relative transcript abundances [3Z], and ultimately 
the success of RNA-Seq hinges on its ability to deliver accurate abundance estimates. 

We therefore focus on the problem of relative transcript abundance quantification in 
this review and begin with a remark about what it means to quantify abundances with 
RNA-Seq data. 
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Remark 1 (Meaning of quantification for RNA-Seq). Since RNA-Seq consists of se- 
quencing RNA (rather than protein), the technology does not measure what is techni- 
cally gene expression. The term "expression" refers to the process by which functional 
products are generated from genes, and although in some cases a gene may consist of a 
non-coding RNA, the abundance of protein coding genes is mediated by translation [19]. 
Even in the case where polyA selection is performed to enrich for mRNA that will be 
translated, what is measured in RNA-Seq are the relative amounts of RNA transcripts. 

It is also important to note that RNA-Seq does not allow for the measurement of 
absolute transcript abundances. Because molecules are sampled proportionately, it can 
only be used to infer relative transcript abundances. 

We are able to describe currently used methods in terms of a single common framework 
that explains how they are all related. Our main result, outlined in Section 2, is the 
observation that the models underlying existing methods can be viewed as special cases 
(or close approximations) of a single model described in [52j. In Section 3 we describe the 
simplest class of models known as "count based models" . In Section 4 we focus on models 
for the estimation of individual relative transcript abundances, and introduce a recurring 
theme which is the equivalence of multinomial and Poisson log-linear models with respect 
to maximum likelihood computations. We continue in Section 5 by describing a general 
model for RNA-Seq analysis that specializes to many previously published models. Next, 
in Section 6, we discuss inference and parameter estimation in RNA-Seq models and in 
Section 7 we examine the applications of such estimates to the comparison of relative 
transcript abundances from two or more experiments. We conclude in Section 8 with a 
discussion and comments on speculations about future developments. 

We have strived to minimize and simplify notation wherever possible, yet have had 
to resort to defining many variables and parameters. To assist the reader. Appendix II 
contains a glossary of notation used. 

2. The RNA-Seq model hierarchy 

Stochastic models of RNA-Seq experiments underlie all methods for obtaining relative 
transcript abundance estimates. In some cases, the underlying models are only implicitly 
described. For example, this is the case in "count-based" methods, where the total 
number of reads mapping to a region (normalized by length and total number of reads 
in the experiment) is used as a proxy for abundance. As we show in the next section, 
this intuitive measure is based on an underlying model that is multinomial, and the 
normalized counts can be understood as the maximum likelihood estimate of parameters 
corresponding to relative transcript abundances based on the model. 

We have organized the published models for RNA-Seq analysis and display the rela- 
tionships among them in Figure [TJ Each node in the graph corresponds to a model for 
RNA-Seq, and models are nested so that if there is a descending path from one node to 
another, then the latter model is a special case of the former. In other words, the figure 
shows a partially ordered set depicting relationships among models. Nodes are labeled 
by published methods (first author+year+citation), and the shaded ellipses describe the 
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features modeled. The complexity of a model corresponds to the number of ellipses it is 
contained in. Features modeled include: 

• count based models: the node at the bottom, contained only in the "single-end 
uniquely mappable reads" ellipse, refers to single end reads (i.e. not paired- 
end reads) models in which all transcripts have a single isoform and reads are 
uniquely mappable to transcripts. In the simplest instance of such models there 
is no modeling of bias. 

• multi- reads (isoform resolution): these are models for individual relative tran- 
script abundances in the case where reads may be sequenced from transcripts in 
genes with multiple isoforms. Equivalently, these are models for "multi-reads" 
which are reads that map to more than one transcript (not necessarily from the 
same gene). The first such model was proposed in [69] and is equivalent to later 
formulations in [Sni US, El] . 

• paired-end reads: these are models that include specific parameters for the length 
distribution of fragments. This is relevant when considering paired-end data in 
which the reads that form mate-pairs correspond to the ends of fragments that 
are being sequenced. Such models require estimation of fragment length distri- 
butions. The first paired-end model was published in [61j and they subsequently 
appeared in [HI [221 SSI ED 

• positional bias: this refers to the non- uniformity of fragments along transcripts, 
and has been hypothesized to be the result of non-uniform fragmentation during 
library preparation. It was first modeled in [311 El HE] and later in [68j but only 
for single-end reads. The model in [31] was extended to a paired-end model in 

m- 

• sequence bias: it has been empirically observed that sequences around the begin- 
ning and end of fragments are non-random leading to hypotheses that priming 
and fragmentation strategies bias fragments |[16j. Models with parameters for 
sequence bias are [521 E2]- In [S2] both sequence bias and positional bias are 
modeled. It should be noted that sequence bias has been modeled indirectly, as 
in [39] using GC content as a proxy (see [52] for more on the connection). 

• In addition to the modeling of the specific features/effects discussed above, mod- 
eling of errors in reads is also discussed in some papers, e.g. [59l |3T]. It is not 
explicitly mentioned in Figure [T] because some models incorporate it implicitly in 
an ad hoc way by filtering during the mapping step (not discussed in this review). 

Remark 2 (Mathematical equivalence of models). Figure [l] is more than just a cartoon 
organizing the models. Models from papers that appear in the same box are mathe- 
matically identical in terms of the quantification results they will produce (although not 
necessarily when used for differential analysis, see Section 7). We expand on this remark 
in the following sections. However, it is is important to note that although models may 
be equivalent, programs implementing them may not be. Implementation details are 
important and non-trivial and can result in programs with drastically different perfor- 
mance results and usabihty. For example, careful attention to data types and processing 
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Figure 1 . Models for RNA-Seq. The figure sliows a Venn diagram (and 
the partially ordered set it induces) representing relationships among mod- 
els. More general models are nested inside simpler models. Models in the 
same box are mathematically equivalent and the boxes are organized so 
that more complex models (i.e., with more parameters) are positioned 
above simpler models with fewer parameters. Model types are color coded 
and the dashed line separates multinomial, Poisson and generalized Pois- 
son models and negative binomial models from normal linear models. This 
is because the MLE obtained using the normal linear models in the multi- 
read case approximates the MLE obtained using the Poisson or multino- 
mial models. Two models |6l US] are connected with dashed lines to [16j 
because they include normalization steps that are related to sequence bias 
correction, but they are not strictly special cases of [16]. Some models, 
such as pn [59] . include terms for modeling errors in mapping that in 
principle can be adapted to model any biases/features (meaning that the 
models are in some sense the most general possible) but if applied in that 
way would be impractical (see Remark [?]) which is why they are not at 
the top of the hierarchy. 
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of reads in a manner tailored to the RNA-Seq protocol and sequencing technology used 
can improve quantification results. 



3. Count based models 

In this section we consider models that assume that all reads are single-end and that 
they map uniquely to transcripts. Such models are also known as "count based models" . 
In the simplest (generative) version of such a model, the transcriptome consists of a set 
of transcripts with different abundances, and a read is produced by choosing a site in 
a transcript for the beginning of the read uniformly at random from among all of the 
positions in the transcriptome. 

Formally, if T is the set of transcripts (with lengths lt,t G T) we define p = {pt}tGT 
to be the relative abundances of the transcripts, so that YlteT ~ ^- denote by F 
the set of (single-end) reads and let Ft C F be the set of reads mapping to transcript t. 
Furthermore, we use assume that all the reads in F have the same length m. Note that 
in transcript t, the number of positions in which a read can start is It = k — rn + 1. The 
adjusted length It is called the effective length of t. 

In the generative model, first a transcript is chosen from which to select a read / by 

(1) P(/Gt) = 



Next, a position in that transcript is selected uniformly at random from among the 
It — m + 1 positions. Thus, the likelihood of observing the reads F as a function of the 
parameters p is 



(2) c{p) = nn 

teT feFt 



Pth 



^^r£T Pr^r 



If we denote by Xt the number of reads mapping to transcript t, i.e. Xt = \Ft\, then we 
can rewrite the likelihood function as 

Xt 

(3) c{p) = n ' 



pth 1 



teT 



There is a convenient change of variables that reveals Equation |3] to be that of a log- 
linear model. Let a = {at}t&T denote the probabilities of selecting a read from the 
different transcripts: 

(4) at := P(/Gt) = 



Note that 

(5) at = 1 and a* > for all t. 



teT 
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Moreover, by Lemma 14 in the Supplementary Material of [^1], for any probability dis- 
tribution {at}teT (i-e., satisfying the conditions of Equationjs]), we can find a probability 
distribution p so that Equation |4] is satisfied by setting 



qt 
It 



(6) Pt = 

It follows that if Equation [3] is rewritten as 



Xt 

at 



oc I I af\ 



(7) C{a) = n 

n 

then the unique maximum likehhood solution for a can be transformed to the (unique) 
maximum likelihood solution for p. That is, from 

(9) = i 

where N = ^t^-pXt is the total number of mapped reads, we can apply ([6| to obtain 
that 



(10) Pt = 
(11) 

(12) oc 

(13) oc 
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Equation 13 is the RPKM (reads per kilobase per millions of reads mapped) formula with 
which to measure abundance from (with the exception that in [H] It is used instead 
of It). Note that RPKM can be viewed as a method because the term abbreviates the 



procedure of evaluating Equation [T3j However the derivation above shows that RPKM is 
better thought of as a unit with which to measure relative transcript abundance because 
it is (up to a scalar factor) the maximum likelihood estimate for the p. Moreover, the 
statistical derivation of relative abundance in RPKM units reveals that effective length 
should be used instead of length, and it is evident that abundance estimates reported in 
RPKM units are not absolute, but rather relative. 

Remark 3 (Normalizing the total number of reads). The number N used has been 
defined to be the number of mapped reads however we note that one can replace it 
by the number of sequenced reads if an extra faux "noise" transcript is included in 
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the analysis as the source of the unmappable reads [21] • We use the term "noise" 
because unmappable reads may represent sequencing mistakes that result in meaningless 
data. Also, additional normalization steps may be applied to improve the robustness of 
relative transcript abundance estimates because extensive transcription of even a single 
gene can drastically affect RPKM values. To address this, quantile normalization was 
proposed in [7] and is implemented in a number of software packages for RNA-Seq 
analysis, e.g.[Tl |26l [61] . 

The model described above is suitable for single isoform genes and is therefore appro- 
priate in organisms such as bacteria where there is no splicing. However in higher Eu- 
karyotes with extensive alternative splicing, alternative promoters, and possibly multiple 
polyadenylation sites, there may be ambiguity in the assignment of reads to transcripts 
and more complex models are necessary. Nevertheless, Equation [3] has been used for 
inference where reads map to multiple isoforms of a gene by using one of two differ- 
ent approaches that we refer to as projective normalization and restriction to uniquely 
raappable reads. 

Projective normalization is an approach to estimating the relative abundance of genes 
consisting of multiple transcripts (corresponding to different isoforms) directly from the 
total number of reads mapping to the gene locus. The approach is based on Equation 



13 but with Xt replaced by the total number of reads mapping to the gene, and It re- 



placed by the total length of all the transcripts comprising the gene after projection into 
genomic coordinates (i.e., the union of all transcribed bases as represented in genomic co- 
ordinates). This approach has been used in a number of methods, including p3l |211 H9]. 
In [nH Proposition 3, Supplementary Material] it is proved that projective normaliza- 
tion always underestimates relative gene abundances, and in [66] empirical evidence 
is provided demonstrating that estimation of individual relative transcript abundances 
(by maximum likelihood) improves the accuracy of relative transcript abundance esti- 
mates. That is, the type of model presented in the next section improves on projective 
normalization. 

Restriction to uniquely mappable reads consists of only utilizing reads that map 
uniquely to features of interest. Such an approach can be used for abundance estima- 
tion in multiple isoform genes by identifying unique features of transcripts (e.g. splice 
junctions unique to an isoform) and applying Equation 13 where Xt is replaced by the 
read counts for that feature and It is suitably adjusted for the length of the feature. 
Restriction to uniquely mappable reads has the major problem that valuable data may 
be omitted from analysis because it cannot be mapped to a unique transcript feature. 
For example, in many transcripts, the unique feature may consist of a single junction. 
The approach of [2H] consists of the restriction to uniquely mappable reads via an ad- 
justment of the length It in Equation 13 whereas in [SH] the correction is performed 
through adjustments to coverage. 

Remark 4 (Species abundance estimation in metagenomics is related to transcriptome 
analysis). The single end read model discussed above is relevant to abundance estimation 
in metagenomics, where the problem of estimating relative abundances of genomes in 
a community is related to relative transcript abundance estimation of single isoform 



8 



Lior Pachter 



genes. For example, a formula equivalent to Equation 13 appears in [2]. Another closely 
related problem is that of inferring haplotype frequencies from pooled samples [M]. 
These analogies between metagenomics, haplotype inference and transcriptomics can be 
extended further. For example, de novo transcriptome assembly requires the assembly of 
related sequences as in metagenome assembly. One key difference between metagenome 
and transcriptome assembly is that sequences from related species are more divergent 
than sequences of isoforms from a single gene. 

4. Models for multi-reads: estimating isoform abundances 

As discussed in the previous section, RNA-Seq allows for the estimation of the relative 
abundance of transcripts, however, this requires an extension of Equation |3] to allow for 
ambiguously mapped reads. In this section we show that the model of [69] is equivalent 
(with the exception of a length factor) to the model of [20], and show that the basis of 
the equivalence applies to many other models. 

We begin by considering the likelihood function in [69] because that paper is, to 
our knowledge, the first publication of the inference model needed for transcript level 
relative abundance estimation from RNA-Seq. In ^\ inference of relative abundances 
using expressed sequence tags (ESTs) is discussed, but in terms of the model there is no 
difference between ESTs and RNA-Seq with the exception of assumptions about whether 
EST counts scale with the length of transcripts, an issue we comment on below. The 
likelihood function that is derived describes the probability of obtaining reads from a 
transcriptome with K transcripts. Using the notation of [69j, we let Y = {yi,k}iLi k=i be 
the matrix defined by t/i^k = 1 if read i aligns to transcript k and otherwise. This matrix 




is called the compatibility matrix (for an example see Equation 38). The likelihood is 
then shown to be 

(14) C{a) = 

Here a = (ai, . . . , ax) is defined as in the previous section (in [UH] «fe is denoted by pk) 
and k G {1, . . . , K} indexes the transcripts of the gene. We note that the denominator 
Ik (length of isoform k) is missing in [69], which can be interpreted as an assumption 
that the frequency of ESTs from a transcript is directly proportional to its abundance 
(independent of the length of the transcript). This assumption makes sense based on 
the oligo(DT) priming strategy for ESTs, however some papers have suggested that 
truncated cDNAs contribute substantially to ESTs due to internal priming [42J. If this 
is the case then the addition of the length in the denominator is warranted; in the next 



section we derive a more general model of which Equation 14 is a special case, and in 



that derivation the reason for the denominator will become apparent. 

Remark 5 (Length normalization). The (deliberate) omission or missed inclusion of the 
denominator Ik in [69j probably did not matter much in practice because the denominator 
is not needed if the lengths of all isoforms are equal. In that case, the likelihood function 
is changed by a scalar factor and therefore the maximum likelihood estimates for the 
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incorrect and correct models will be the same. Since abundance estimates were only 
evaluated qualitatively in the presence or absence of the denominator may not have 
changed results much. On the other hand, the paper |16] is specifically about RNA-Seq 
and therefore the omission of the denominator is an error. 

To describe the model of [20], we first introduce notation needed for multi-reads, which 
are reads that may map to multiple transcripts. Note that there are two reasons why 
reads may map to multiple transcripts: first, parts of transcripts may overlap in genomic 
coordinates (in the case of multi-isoform genes) and secondly, gene families may result 
in duplicated segments throughout the genome that lead to ambiguous read mappings 
(see Figure [2]). In this section we restrict ourselves to the ambiguous mappings resulting 
from multiple isoforms of a single gene (to be consistent with the likelihood formulation 
of [69]), but we introduce notation for the general case as it will be useful later. 

Let T = {{t,i) : t & T,i E {1, . . . ,lt}} be the set of all transcript positions. We 
first define a symmetric relation on the set of positions by the relation {t,i)R{u, j) if 
there exists some fragment / e F so that the 3' end of / aligns to both {t,i) and {u,j). 
Note that if we allow for some mismatches in alignments, the relation R may not be an 
equivalence relation. We will require the properties of equivalence relations in specifying 
models so we replace R by its equivalence closure (also called the transitive closure) and 
denote the resulting equivalence relation by ~. 

The set of all equivalence classes is the quotient of T by ~ and we denote it by 
U = T / ~. Given an equivalence class s G f/, -F, C F is the set of all fragments aligning 
to some element in s (note that this induces an equivalence relation on fragments). For 
simplicity, we also assume that every fragment with a 3' alignment to some transcript t 
has only one alignment of its 5' end to a location in t. 

In [20], it is assumed that for a set of aligned fragments F, for every s E U the number 
of reads starting at s is Poisson distributed with rate parameter 



K 



(15) = E 



k=l 



1^ 

Cs,k ~ ; 

Ik 



where is a rate parameter for transcript k. Here C = {cs^k}k=i s&u ^ site-transcript 
compatibility matrix with Cg ^ = 1 if transcript k appears in some element of s, and 
otherwise. The parameters are the expected number of reads from each transcript k. 
If the observed number of reads starting at s is X^, then the likelihood (Equation 2 in 
[20] ) is given by 

_ / p — ^s \Xs 

(16) c{k) = n 



's 



At first glance Equations 14 and 16 look very different. The underlying models are 
distinct not just in name but in the generative model for RNA-Seq that they imply. In 
the Poisson model counts are Poisson distributed, which is only approximately the same 
as the count distribution from the multinomial model. It is therefore no surprise that 
the likelihood functions for the two different models have a different form. Moreover, 
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neither the parameters a nor p appear in Equation 16 However, we will see that 



Equation 16 can be written in terms of parameters equivalent to the a and that the two 
likelihood functions are maximized at the same value. This follows from an elementary 
and well-known equivalence between multinomial and Poisson log-linear models [25j. It 
is discussed in the RNA-Seq case in [51] but for completeness and clarity, we review the 
connection between the models in detail below. 



Proposition 1. Maximum likelihood 'parameters obtained via Equations 14_ and 16_ lead 
to exactly the same relative transcript abundance estimates. 

Proof: Note that Yl,s&-^s (from [20]) = N (from [69]) by definition. We will also 
define parameters 13k := j^, Z = Ylik=il^k and 7^ = ^. Note that Yl!k=i'^k = 1 and 
7fc > 0, however since the Kk and Pk are unconstrained the factor Z can, a priori be 
arbitrary. The proof is based on the result that the maximum likelihood values for n are 
obtained when Z = 1. We derive this after proving a simple combinatorial lemma that 
is useful inside the main argument. The lemma states that each Kk is counted once when 
summing over the equivalence classes of positions (and suitably normalizing by length): 

Lemma 1. 



K K 



k=l si^U k=l ^ 



Proof: 



K K 

(17) = J]5^c,,fc^ 

s&U k=l '•^ k=l sGU ''^ 



K 

l-k 



Ik 



(18) = Y.~^ 

k=l 
K 

(19) = □ 



From this it follows that 



(20) EE^^.^r = '^^^^' 

seu k=i ''^ k=i 



'k 
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which is used in the proof of the proposition: 

(21) c{K) = n 

(22) 



(23) oc e 



(24) = e-^^-^^nlE^^-^'^r 

s& \k=l ''^ 




AT 




(25) oc 



(26) oc 

A key observation is that the parameter Z in the left parentheses can be maximized 
independently of the 7^ in the right parentheses. This is because Z is unconstrained and 
the constraints on 7 do not involve Z (the 7 only need to be non-negative and must sum 
to 1). The maximum likelihood estimate for the normalization constant Z is Z = 1 and 
therefore (3k = and the maximization of Equation [16] is equivalent to maximizing 

(27) m 

where = 1 and > 0. The interpretation of the /3 parameters in the Poisson 

model [20] is equivalent to the interpretation of the a parameters in [69], i.e. they are the 
probabilities for choosing reads from transcripts. Therefore, both model formulations 
are equivalent. □ 




5. A GENERAL MODEL FOR RNA-SeQ 

In this section we present a general model for RNA-Seq that includes as special cases 
the other models discussed in this review. As before, we assume that every fragment 
with a 3' alignment to some transcript t, has only one alignment of its 5' end to a location 
in t. This means that the notion of the length of a fragment relative to a transcript is 
well-defined, and we denote this length by lt{f)- This assumption is slightly restrictive, 
but simplifies a lot of the notation. 

In a generative description of the model the 3' end of a fragment is selected first. The 
probability of selecting the 3' end from a specific site within a transcript depends on 
the abundance of the site relative to others (as determined by the relative abundance 
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Figure 2. Notation used in describing RNA-Seq models. In this exam- 
ple there are three transcripts (^1,^2, ^3) whose abundances are denoted 
by ptj^, pt^, pt.^. Fragment lengths are distributed according to D, and the 
length of a transcript t is denoted by It- The effective length It is an 
adjustment of the length taking into account bias and fragment length 
constraints. Note that hiif) is the length of the alignment of fragment 
/ to transcript ^2, and may be different from hiif)- In the example, 
transcripts ^2 and ts overlap in genomic coordinates. Three positions 
{ti,i), {t2,j), (ts, k) have been selected, one from each transcript, that lie 
in the same equivalence class (i.e. cannot be distinguished in mapping). 
This is indicated by (ti,i) ~ (t2,j) ~ {hjk). Note that (t2,j) ~ (^^3) ^) 
because they overlap in genomic coordinates whereas in the case of {ti,i) 
the transcript is part of a different gene. A fragment / is shown mapping 
to transcript ^2- 



parameters p), on the local sequence content, and also on the relative position of the 
site within the transcript. Then a length for the fragment is selected according to a 
distribution D and again according to the local sequence content. 

We are interested in estimating the pt, but due to the other parameters specifying a 
sequencing experiment we infer them indirectly. To specify the probability of a fragment 
with specific 5' and 3' ends, we require the following parameters: 

• The fragment length distribution denoted D which is a distribution whose sup- 
port is the positive integers. 

• Site specific bias: parameters U(^t,i) denoting the 3' site specific bias for position i 
in transcript t. These parameter are non-negative real numbers (or "weights"), 
so that no bias corresponds to all U(^t,i) = 1- Similarly, the 5' site specific bias for 
position i in transcript t is denoted by f(t,i). 
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• Positional bias: parameters where a; G [0, 1] that are non- negative real num- 
bers. 

• Errors in reads: parameters Ctj denote the probability of observing the sequence 
in / assuming that it was produced from transcript t. We assume (for simplic- 
ity) that every fragment could have been generated from at most one position 
in each transcript. The error probabilities Ctj can be estimated according to 
an error model for sequencing and based on the number and locations of mis- 
matches/indels between / and t. Note that Ctj can be viewed as a generalization 
of Ui^k from the previous section where i is replaced by / and khj t and the values 
form a probability distribution rather than being restricted to the set {0, 1}. 

Our model specifies that the probability of selecting a fragment / with 3' end {t, i) given 
that / originates from transcript t is 

Ej— 1 Dli—j) 

(28) P(r = (t,OI/et) = -^''^^ ^ ^ — , 

where 

(29) h = I] l^f -^(M) -^Z^^lS^W^^^CtJ) 

As in Section 3, It is the effective length of transcript t. The probability of selecting a 
fragment / with 3' end [t, i) is now given by 

(30) ;i(,,,):=P(/3' = (t,,)) = p(/et)P(f' = (t,0|/et) 

V^i-l D(i-j) 

(31) = ^ fc) ^ 

It 

and the conditional probability that a fragment / with 3' end {t,i) has length lt{f) is 

(32) C(i) := mf)=lmf' = it,^)) = /))^(^--;+^) . 
Therefore, the probability of generating a fragment / of length lt{f) with 3' end {t, i) is 

(33) P(/3' = (t, ,),/(/) = /,(/)) = 

(34) = at ~ . 

h 

The generative model we have just described is summarized in Figure |3} 
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Figure 3. A generative (graphical) model for RNA-Seq. 

We can now derive the likelihood for a set of fragments: 
(35) C{a) = n n E «*e,/P(/-^' = (t,0,/(/) = ltU)\f e t) 



s£U feFs (t,i)Gs 



Li) 



(36) = n n E -"('.oc 

seu feFs {t,i)&s 

7et Mitres ltY.k=iD{i-k) 

This is a linear model (for fixed bias and error parameters) that is concave. In the 
next section we discuss how maximum likelihood estimates are obtained. Note that 
the inferred parameters a can be translated to maximum likelihood estimates for the 
relative abundances p using Equation [6j Estimates are usually reported in units of 
FPKM (fragments per kilobase per million mapped fragments) [61] and are a scalar 



multiple of the p as in Equation 13 



In the case when bias is not modeled, i.e. ui^t,i) = V(^t,i) = 'W{t,i) = 1, then Equation 37 
reduces to Equation 9 in the Supplementary Methods of j61j (with the slight difference 
in model as explained in Remark [6]). This is also the model in [31] (with pt named Ot) 
and where the parameters W{t,i) are in the model but U{t,i) = V(^t,i) = 1- The paper [31] 
describes a single read model and the paired-end case will appear in [30]. This model 
is also the one used in [16] {at is named Pt, and formulated as an equivalent Poisson 
model). 

Remark 6 (Directional assymetry in RNA-Seq). There is some confusion in the current 
literature about how to model the generation of paired-end fragments in RNA-Seq ex- 
periments (e.g. in [12] three "strategies" are proposed). Some models consider a process 
where the 5' end of a read is generated first (usually uniformly at random) followed by 
the 3' end according to the fragment length distribution (normalized appropriately if the 
5' site is close to the 3' end of the transcript). In the model described above, we have 
preferred to assume that the 3' site is generated first, because that more closely mimics 
the actual protocol. Alternatively, one can assume that first a length is chosen according 
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to the fragment length distribution, and then the 5' and 3' sites are chosen [52] ■ Such a 
model may better capture the fact that size selection follows reverse transcription and 
double stranded cDNA generation. It is at present unclear which formulation produces 
best estimates, but regardless of the choice the effect on the likelihood function is to 
slightly alter the form of the denominator. 

Remark 7 (Errors in reads). The error model we have included (parameters etj) is 
analogous to the formulation in [59] and allows for a general and position-specific mod- 
eling of errors. However, it should be noted that there is a connection between errors 
and read mapping that can skew results even when errors are modeled. A read with a 
number of errors beyond the threshold of the mapping program used is not mapped, and 
this missing data problem is not addressed in our model. This issue can affect expression 
estimates, especially in the case of allele specific estimation [8] (see also remark below). 
It is therefore advisable to correct errors before mapping, using methods such as in [70] . 
In [59j all possible mapping are considered, i.e. the entire read-transcript compatibil- 
ity matrix is constructed, and while this is the most general model for RNA-Seq since 
in principle the "error" model can capture any feature, it is impractical because the 
compatibility matrix is too large to work with explicitly in practical examples. 

Remark 8 (Allele specific estimates). It may be desirable to infer relative transcript 
abundances for individual haplotypes and this problem is addressed in various methods 
papers [HI ESI SSI SSI [S2]- It is possible using the formalism we have described by simply 
doubling the number of transcripts (one for each haplotype) and utilizing the error 
parameters etj to obtain probabilities for each fragment originating from each of the 
haplotypes. Note that if the haplotypes are unknown then heterozygous sites can be 
inferred from the mapped fragments, but we do not consider that problem in this review 
as it is a mapping issue rather than a modeling problem. 

The model we have presented is multinomial, however as discussed previously it can 
be formulated as an equivalent Poisson model. The details, and a proof that the two 
formulations are equivalent, is provided in Appendix I. 

6. Inference 

The single read single isoform model described in Section 2 is log-linear and therefore 
admits a closed form solution. Models that allow for ambiguous read mapping are no 
longer log-linear, but they do have nice properties and, assuming that bias effects are 
known, are concave [20l |6T] . This means that numerical algorithms can be used to find 
the (unique) global maximum assuming that the model is identifiable. 

Remark 9 (Identifiability). Identifiability of RNA-Seq models is addressed in [2H HT] 
and appears even earlier in the computational biology literature in related models used in 
other applications (e.g., [17]). Identifiability is the statistical property of "inference being 
possible". Mathematically, that means that different parameter values (relative tran- 
script abundances) generate different probability distributions on read counts. Testing 
for identifiability in our setting is equivalent to determining whether the compatibility 
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matrix (Section 4) is full rank [T7j. The identifiability problem is related to the transcript 
assembly problem, because with certain assumptions assemblies can be guaranteed to 
generate identifiable models for the aligned reads (this is the case with Cufflinks assem- 
blies [6TJ). 

Typically the Expectation-Maximization (EM) algorithm is used for optimization be- 
cause of its simplicity in formulation and implementation. 

Example 1. Consider a gene with three transcripts of equal length labeled red, green 
and blue, and with 5 single-end reads aligning to the transcripts according to the con- 
figuration shown in Figure |4j If the reads are labeled a, b, c, d, e then the compatibility 
matrix (see Section 4) is 

a b c d e 
red /l 1 1 1 

(38) Y = green 110 1 

blue yi 1 1 

If the transcripts have abundances Pred' Pblue' Pgreen then the estimation of the p 
(using the model in Section 3) is the mathematics problem of maximizing the likelihood 
function 

(39) C{p) = (pred + Pblue) (Pred + Pgreen) (Pblue + Pgreen)Pred 

subject to the constraint Pj-ed + Pblue + Pgreen = 1- First, we note that the matrix Y 
is full rank (rank=3) and therefore the model is identifiable. 

Proposition 2. The maximum likelihood solution is given by 

(40) Pblue = Pgreen = ~m 

Proof: For notational simplicity we let x = Pblue> — Pgreen, z = Pxed- The goal is 
to maximize f{x,y,z) = z{l — x){l — y){l — z) subject to the constraint that x + y + z = 1 
and x,y,z >0. Note that f{x,y,z) = {x + y){l — {x + y)){l + xy — (x + y)). For any value 
X + y at which / is maximized, it follows that xy must be maximized conditional on 
the sum, and that occurs at x = y. Therefore, the problem is equivalent to maximizing 
2x(l — 2x)(l + — 2x) where < x < ^. The maximum is at x = ^ — □ 

In the EM algorithm, relative transcript abundances are first initialized (e.g. to the 
uniform distribution). In the expectation step every read (or fragment in the case 
of paired-end sequencing) is proportionately assigned to the transcripts it is compati- 
ble with, according to the relative transcript abundances. Then, in the maximization 
step, relative abundances are recalculated using the (proportionately) assigned fragment 



counts. In other words, the maximization step is the application of Equation 13 where 
counts are replaced by expected counts. A key property of the EM algorithm is that the 
likelihood increases at every step [45]. The illustration in Figure |4] shows 2 iterations of 
the algorithm and one can observe the convergence of p with Pbiue — 0.33, 0.27, 0.23, .... 
EM theory states that Pblue converge to the value in ( [40| ). 
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Figure 4. Illustration of the EM algorithm. The gene has three isoforms 
(red, green, blue) of the same length. There are five reads (a,b,c,d,e) 
mapping to the gene. One maps to all three isoforms, one only to red, and 
the other three to each of the three pairs of isoforms. Initially every isoform 
is assigned the same abundance (^, |, |). During the expectation (E) step 
reads are proportionately assigned to transcripts according to the isoform 
abundances. Next, during the maximization (M) step isoform abundances 
are recalculated from the proportionately assigned read counts. Thus, for 
example, the abundance of red after the first M step is estimated by 
0.47 = (0.33 + 0.5 + 1 + 0.5)/(2.33 + 1.33 + 1.33). 
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A derivation of the EM algorithm in the case of the model just described (from the 
general EM algorithm) is detailed in [55] . 

Remark 10 (Rescue for multi-reads). The "rescue" method for multi-reads [H] is equiv- 
alent to one step of the EM algorithm (this important observation was made in |31j). 

In some cases least squares has been used instead of the EM algorithm, e.g. in 
O [121 [IB]- Least squares estimates are equivalent to maximum likelihood estimates 
under the assumption that the counts are normally distributed (with equal variances 
in [5], Poisson derived variances in [12] and possibly with a more complicated variance- 
covariance structure in [18j requiring an additional estimation procedure). Multivariate 
normal models approximate multinomial models well when counts are high but are not 
suitable when counts are low. There can be a computational problem with least squares 
inference. For example, the problem formulation in [i8\ results in an algorithm requir- 
ing matrix inversions of sizes exponential in the number of transcripts. Furthermore, 
the least squares approach requires constrained optimization (quadratic programming) 
because relative transcript abundances must be non-negative. This is done in |5],|12] but 
not in [I8l where instead a heuristic is used and negative estimated values are truncated 
to zero. 

The method in [3H] is similar to the least squares approaches described above, except 
that Li minimization is performed. Unfortunately, the details of the method are not 
sufficiently well explained to permit a direct comparison to other published methods and 
therefore it has been omitted from Figure [ij 

The inference approaches of [201 [221 [61] differ from other published methods in that 
they are Bayesian rather than frequentist. In [22], instead of using maximum likeli- 
hood to infer p, the posterior distribution is computed assuming a Dirichlet prior. The 
approach of [20] and [61j is to use a Gaussian prior which is chosen according to the 
MLE estimated using the EM algorithm (with a variance-covariance matrix based on 
the inverse Fisher information matrix from asymptotic MLE theory). An importance 
sampling procedure is then used to sample from the posterior distribution ^3]. 

7. Differential analysis 

Accurate relative transcript abundance estimates are important not just for compar- 
ing isoform abundances to each other, but also for many other RNA-Seq applications. 
A common question for which RNA-Seq is now used is to determine differentially tran- 
scribed RNAs (the term differential expression is frequently used but it is a misnomer 
and the term differential analysis is more accurate). Clearly accurate relative tran- 
script abundance estimates are desirable in order to have power to detect differentially 
transcribed RNAs. 

First we note that the multinomial models and Poisson models are equivalent in terms 
of differential analysis by virtue of the following elementary lemma: 

Lemma 2. Suppose Xi, . . . , X„ are Poisson distributed with rates Ai, . . . , respectively. 
Then the distribution ofXi,..., X„ conditioned on the sum Y17=i Xi = N is multinomial. 
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These models, however have been found to be unsatisfactory for differential analysis 
because observed counts in technical, and even more so, biological replicates, do not 
behave multinomially. This is a common phenomenon observed in count-based exper- 
iments, and is referred to as over- or under- dispersion. Papers such as [U |53l EEl [65] 
have proposed numerous alternatives to the multinomial model, for example assuming 
instead that counts are distributed according to a negative binomial, or generalized Pois- 
son distribution. The differences between the methods [U [53l [65] are mainly in how they 
estimate parameters, and a thorough discussion of how they differ is beyond the scope 
of this review. A key point, however, is that these approaches all focus on single-end 
uniquely mappable reads (i.e., the models in Section 3) and this has the major drawback 
of not allowing for differential analysis of individual transcripts in multi isoform genes, 
and of biasing results in gene families. Recent versions of the Cufflinks software [61] 
address this problem. 

Remark 11 (Models, quantification and differential analysis). We can now summarize 
the relevance of model selection for quantification and differential analysis as follows: 
Multinomial models and Poisson models are equivalent for quantification and for dif- 
ferential analysis (by virtue of Lemma [2]). Negative binomial models are equivalent to 
multinomial models for quantification, but result in different (more conservative) results 
in differential analysis. Negative binomial models will produce different relative abun- 
dance estimates if generalized to the multi-read case but this has not yet been done. 
Normal linear models in the multi-read case result in different quantification and differ- 
ential analysis results although they can be viewed as an approximation of the Poisson 
(or multinomial) models. 

Finally, we note that optimized differential analysis of RNA-Seq experiments involves 
not only the selection of appropriate models and robust parameters estimation techniques 
but also appropriate experimental design [3]. 

8. Discussion and future directions 

Models of RNA-Seq have advanced greatly in complexity and accuracy in the three 
years since the protocol was developed; in this review alone we have discussed more 
than 30 different methods that have been published. At the same time, there has been 
considerable progress in RNA-Seq technology, both in protocol development to reduce 
bias |29], and in sequencing technology that has resulted in much higher throughput 
|36j . These developments have led to remarkable progress in the accuracy of RNA-Seq 
based relative transcript abundance estimates in the short time since introduction of the 
technology. Eventually, as single molecule based technologies mature |50J and reads as 
long as entire transcripts can be produced [TU], RNA-Seq will be "solved" in the sense 
that a single sequencing experiment will directly reveal the transcriptome being queried. 

However with present technologies, and for the foreseeable future, improvements in 
RNA-Seq modeling are required to best utilize the data produced in RNA sequencing 
experiments. Furthermore, related technologies such as ribosomal profiling [19] require 
similar models because they rely on probabilistically assigning short reads to transcripts. 
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yet they are limited by experimental rather than technological issues and the types of 
models we have described are therefore likely to be relevant for a long time. 

A key question, in light of the many remarks we have made, is what models are relevant 
in practice, and how a model should be selected in conjunction with available sequencing 
technology. We begin with what we believe is the most important consideration: 

Remark 12 (Read length and model). The multi-read models presented in Section 3 
that are suitable for reads which may map to multiple transcripts are essential for accu- 
rate quantification and differential analysis, even in organisms without splicing or with 
few multi-isoform genes. This is because of multi-gene families, and repeated domains, 
that can result in ambiguously mapped reads. This issue is particularly pronounced for 
short reads (< 50'bp) [621 Table 1]. Longer reads and fragments mitigate multi-mapping 
issues due to gene duplicates, however in that case effective length corrections (Section 
2) become very important. Use of Equation 
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with It in the denominator instead of It 
can affect relative abundance estimates by up to 30% for fragments with an average size 
of 200 (as is common in current protocols). It is important to note that the improvement 
in mapping of longer reads may come at the expense of the number of reads, and in [21] 
it is shown that more short reads are better for accurate quantification than fewer long 
reads. However, it should also be noted that for other RNA-Seq analysis problems, e.g. 
transcriptome assembly, long reads and fragments are crucial [61] . 



The simplest models discussed in this review assume uniformity of fragment location 
across transcripts, yet observed data does not conform to this assumption (see, e.g. [TT| 
Figure 6]). It is therefore highly desirable to model bias using more general models, and 
results in recent papers such as [52J show better agreement between RNA-Seq and qRT- 
PCR based estimates when bias is taken into account. Even newer protocols result in 
biases [29] that can be mitigated via appropriate modeling. However even with current 
state of the art corrections for sequence and positional bias, unexplained biases in the 
data continue to be observed [321 [52]. A possible source of bias that remains to be 
explored and may affect RNA-Seq experiments is secondary structure in RNA, and the 
effect it may have on the various fragmentation and reverse transcription steps of existing 
protocols. Recent progress on sequencing based assays for measuring structural features 
of RNA molecules may help to establish a connection between structure and bias, and 
to quantify the effect if it exists [H EH] [351 ES] . 

A crucial area where progress is needed is in the assessment of accuracy of RNA-Seq. 
The exact accuracy of relative abundance estimates based on the techniques reviewed 
in this paper is currently unknown, and benchmarks with respect to qRT-PCR [7J or 
nanostring [52] have cast doubt on whether those technologies are more accurate than 
RNA-Seq and suitable as "gold standards". Most importantly, there is a need for sys- 
tematic benchmarks where abundances are known a priori, yet where experiments mimic 
the complexities of in vivo transcriptomes. 

Another aspect of RNA-Seq that has yet to be fully explored is the connection between 
relative abundance estimation and transcriptome assembly. In [61], it is shown that an 
incomplete transcriptome can bias relative abundance estimates. For this reason, it was 
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proposed that analyses should be performed with respect to an assembly generated from 
the data, rather than using a "reference" annotation. Indeed, in almost every RNA-Seq 
study performed to date many novel transcripts have been detected, even in extensively 
annotated species such as Drosophila [13]. Transcriptome assembly is therefore crucial 
at the present time for accurate relative abundance estimation. 

Accurate and complete transcriptome assembly is, in turn, dependent on the ability 
to accurately quantify relative transcript abundances. This is because fragment lengths 
are currently much shorter than transcript lengths, and therefore local estimates of 
relative abundance are the only information available for phasing distant exons during 
assembly [M]- For this reason, statistical assembly approaches such as need to 
be developed for transcriptome assembly. Despite initial work by a number of groups 
(personal communication), the problem of how to perform statistical assembly efficiently 
and accurately remains open. 
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Appendix I: Equivalence between the paired-end Poisson and 

MULTINOMIAL MODELS FOR RNA-SeQ 

In this section we show that the general model described in Section 5 is equivalent 
(in the sense discussed in Section 4) to a Poisson model for paired-end reads (with the 
same bias model). 

To describe the Poisson model we need some extra notation: a weak composition of 
n into k parts is an ordered tuple of non- negative integers c = (ci, . . . ,Cfc) such that 
Ci + ■ ■ ■ + Cfc = ra. If c is a weak composition of n we write c \- n. We note that the 
number of weak compositions of n into k parts is 



(41) 



n + k — 1\ fn + k 

k-1 I ~ [ n 



For convenience, we use the notation |c| for the number of parts in a composition. 
We will make use of multinomial coefficients, and we use the convention that 

/ n\ f ^ \ ^' 

\c) \ci,...,Ck) Ci!c2!---Cfc! 

where c h n = (ci, . . . , Cfc) is a composition of n into k parts. 

Finally, we note that a function P : Fs ^ s induces a weak composition of \Fs\ where 
the composition is given by the cardinalities of the sets {/ : P{f) = (t, i)} as (t, i) ranges 
over the elements of s. We use \P\ to denote this composition. 

Example 2. Suppose that S = {(ti, 4), (^2, 6), (ts, 3)} is an equivalence class with 3 
elements corresponding to three distinct positions in three different transcripts (ti, ^2, ^s)- 
Suppose that the set of fragments whose 3' ends align to 5* consists of four fragments: 
Fs = {/i,/2,/3,/4}. There are 3^ = 81 different functions P : -> S. If P(/i) = 
-^(/s) = (tiA) and P(/2) = P(/4) = (^3,3) then P induces the weak composition c h 4 
with 3 parts given by c = (2, 0, 2). We denote this weak composition c by \P\. 

We conclude by highlighting an elementary, yet crucial step in the derivation of the 
likelihood function. Suppose that {aij}^f^j^^ are n ■ m indeterminates. Then the fol- 
lowing two polynomials are equal: 

n n m 

(43) n^^'/w = nsz^^i- 

/:[n]->[m] «=1 «=1 j=l 



The expression on the left in (43) consists of m" monomials, each with n terms, but the 
factored expression on the right shows that the polynomial can be evaluated using only 
0{nm) operations. 

Next we turn to the likelihood function. Let s eU = T/~ and let c h IF^I be a weak 
composition of \Fs\. We define rjs^c to be the sum 

(44) r]s,c = J2 H ^kffP(f)J- 

P:Fs^s:\P\=c /SF, 
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The number t^^.c is the probabihty of observing the assignment of fragments in Fg to 
elements of s where the numbers of fragments originating from each {t, i) G s is given 
by c(t,i). 

Our model for RNA-Seq is Poisson, which means that we assume that the number of 
fragments ending at a given site is Poisson distributed. Specifically, we assume that the 
number of fragments with 3' end (t, i) is 



(45) \t,'i) = 



Ei—l D(i—j) 



where the notation is the same as that in Section 5. As in Section 4, we let /3j = ^. The 
likelihood function we now derive directly generalizes that of [20] to paired-end reads. 
The likelihood function is now given by 



(46) ^ - n E 7^ n 



A' 



Ht.i) 



sec/ \cl-|F»|:|c| = |s| V c ; \{t,i)& 



\t.r)_M. 



seU \ cl-|F,|:|c| = |s| \{t,i)es 



(48) « n n En ^'^>^s 

seU \{t,i)es I \cy\Fs\:\c\=\s\(t,i)^a 

(49) = J]e-^(M).^^*'W I E nCV^(/)jM/) 

sef/ \ch|F,|:|c| = |s| \P:F,^s:|P|=c/eF, 



(50) 



(51) = He-^'*'^)-"'-^ n E 

(«) xfnnEAe. /'"'-'''"-"7^^s^ 

\s& f&Fs {t,i)(is 

(54) ^ .-i:..^. X f n n E 
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We note that the term in parentheses in (54) is exactly the same expression as (37) with 



at replaced by (3f As in the derivation in Section 3, it is easy to see that the likelihood 
functions are maximized at d = /3. 

Appendix II: Notation 

The notation table below shows the names of all the variables we use. It is divided 
into parts using four divisions: the first is notation for structures independent of an ex- 
periment (transcripts). The second is notation for data from the experiment (fragments 
and their alignments). The third is notation for the parameters of the model. The fourth 
consists of helper variables that are calculated from the primary variables in the first 
three categories. 

T set of transcripts 

T set of transcripts together with their positions 

It length of transcript t 

F set of fragments 

Fs The set of fragments aligning to an equivalence class s of alignment positions 

lt{f) the length of a fragment alignment to a transcript t 

Xt number of fragments aligning to a transcript t 

Xs number of fragments ending (or starting) at a site in s 

Y compatibility matrix (between fragments and transcripts) 

C compatibility matrix (between sites and transcripts) 



Pt abundance of transcript t 

X(^t,i) Poisson rate parameter for position i in transcript t 

at probability of selecting a fragment from transcript t in the multinomial model 

Kt Poisson rate parameter for transcript t & T 

Pt The probability of selecting a fragment from transcript t in the Poisson model 

etj probability of observing / given the sequence in t 

U(t,i) 3' bias weight 

V{t,i) 5' bias weight 

Wx positional weight where x E [0,1] 

D Fragment length distribution 

It effective length of transcript t 



rjs^c Probability that fragments were sequenced from position i of transcript t 
(^^■■^ Probability of selecting a fragment of length kif) given 3' end (t, i) 



Table 1. Notation. 
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